FOIL it! Find One mismatch between Image and Language caption
نویسندگان
چکیده
In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MSCOCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.
منابع مشابه
بررسی تأثیر نمایهسازی مفهوم-محور تصاویر بر بازیابی آنها با استفاده از موتور جستجوی گوگل
Purpose: The purpose of the present study is to investigate the Impact of Concept-based Image Indexing on Image Retrieval via Google. Due to the importance of images, this article focuses on the features taken into account by Google in retrieving the images. Methodology: The present study is a type of applied research, and the research method used in it comes from quasi-experimental and techno...
متن کاملEffects of Closed-caption Programs on EFL Learners’ Listening Comprehension and Vocabulary Learning
This study aimed at investigating the impact of closed-caption program on listening comprehension of English movies and vocabulary learning. Sixty-four graduate students studying at Shiraz Islamic Azad University were selected as the participants of the study. The participants were divided into two groups: experimental group (with closed caption program) and control group (without closed captio...
متن کاملStudying the Effectiveness of Security Images in Internet Banking
Security images are often used as part of the login process on internet banking websites, under the theory that they can help foil phishing attacks. Previous studies, however, have yielded inconsistent results about users’ ability to notice that a security image is missing and their willingness to log in even when the expected security image is absent. This paper describes an online study of 48...
متن کاملLeveraging Visual Question Answering for Image-Caption Ranking
Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact...
متن کاملNames and Faces
We show that a large and realistic face dataset can be built from news photographs and their associated captions. Our dataset consists of 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset is more realistic than usual face recognition datasets, because it contains faces captured “in the wild” in a variety of configurations ...
متن کامل